The purpose of this notebook is to:
The models estimated in this notebook will be as follows:
ml.Train <- mlogit(choice ~ price + time + change + comfort | -1, Tr)
ml.Fish <- mlogit(mode ~ price | income | catch, Fishing, shape = "wide", varying = 2:9)
In [1]:
from collections import OrderedDict # For recording the model specification
import pandas as pd # For file input/output
import numpy as np # For vectorized math operations
import pylogit as pl # For MNL model estimation
# To convert from wide to long format
In [2]:
# Load the Train data, noting that the data is in wide data format
wide_train_df = pd.read_csv("../data/train_data_r.csv")
# Load the Fishing data, noting that the data is in wide data format
wide_fishing_df = pd.read_csv("../data/fishing_data_r.csv")
In [3]:
# Look at the raw Train data
wide_train_df.head().T
Out[3]:
In [4]:
# Look at the raw Fishing data
wide_fishing_df.head().T
Out[4]:
Recognizing that the Train dataset is a panel dataset, and recognizing that our estimated MNL model will not take the panel nature of the data into account, we need a new id column that specifies each individual choice situation.
In a similar fashion, the Fishing data needs an observation id column, even though it is not a panel dataset. All datasets being used in pyLogit need an "observation" id column that denotes the id of what is being thought of as the unit of observation being modeled.
In [5]:
# Note that we start the ids for the choice situations at 1.
wide_train_df["choice_situation"] = wide_train_df.index.values + 1
wide_fishing_df["observation"] = wide_fishing_df.index.values + 1
Noting that the columns denoting the choice for both the Train and the Fishing data are string objects, we need to convert the choice columns into integer based columns.
In [6]:
# Convert the choice column for the Train data into integers
# Note that we will use a 1 to denote 'choice1' and a 2 to
# represent 'choice2'
wide_train_df["choice"] = wide_train_df["choice"].map({'choice1': 1,
'choice2': 2})
# Convert the "mode" column for the Fishing data into an
# integer based column. Use the following mapping:
mode_name_to_id = dict(zip(["beach", "pier", "boat", "charter"],
range(1, 5)))
wide_fishing_df["mode"] = wide_fishing_df["mode"].map(mode_name_to_id)
For both the Train and the Fishing data, all of the alternatives are available in all choice situations. Note that, in general, this is not the case for choice data. As such we need to have columns that denote the availability of each alternative for each individual.
These columns will all be filled with ones for each row in the wide format dataframes because all of the alternatives are always available for each individual.
In [7]:
# Create the needed availability columns for the Train data
# where each choice is a binary decision
for i in [1, 2]:
wide_train_df["availability_{}".format(i)] = 1
# Create the needed availability columns for the Fishing data
# where each choice has four available alternatives
for i in range(1, 5):
wide_fishing_df["availability_{}".format(i)] = 1
In [8]:
# Look at the columns that we need to account for when converting from
# the wide data format to the long data format.
wide_train_df.columns
Out[8]:
In [9]:
##########
# Define lists of the variables pertaining to each variable type
# that we need to account for in the data format transformation
##########
# Determine the name for the alternative ids in the long format
# data frame
train_alt_id = "alt_id"
# Determine the column that denotes the id of what we're treating
# as individual observations, i.e. the choice situations.
train_obs_id_col = "choice_situation"
# Determine what column denotes the choice that was made
train_choice_column = "choice"
# Create the list of observation specific variables
train_ind_variables = ["id", "choiceid"]
# Specify the variables that vary across individuals and some or all alternatives
# Note that each "main" key should be the desired name of the column in the long
# data format. The inner keys shoud be the alternative ids that that have some
# value for the "main" key variable.
train_alt_varying_variables = {"price": {1: "price1",
2: "price2"},
"time": {1: "time1",
2: "time2"},
"change": {1: "change1",
2: "change2"},
"comfort": {1: "comfort1",
2: "comfort2"}
}
# Specify the availability variables
train_availability_variables = OrderedDict()
for alt_id, var in zip([1, 2], ["availability_1", "availability_2"]):
train_availability_variables[alt_id] = var
In [10]:
##########
# Actually perform the conversion to long format
##########
long_train_df = pl.convert_wide_to_long(wide_data=wide_train_df,
ind_vars=train_ind_variables,
alt_specific_vars=train_alt_varying_variables,
availability_vars=train_availability_variables,
obs_id_col=train_obs_id_col,
choice_col=train_choice_column,
new_alt_id_name=train_alt_id)
# Look at the long format train data
long_train_df.head()
Out[10]:
In [11]:
# Look at the columns that we need to account for when converting from
# the wide data format to the long data format.
wide_fishing_df.columns
Out[11]:
In [12]:
##########
# Define lists of the variables pertaining to each variable type
# that we need to account for in the data format transformation
##########
# Determine the name for the alternative ids in the long format
# data frame
fishing_alt_id = "alt_id"
# Determine the column that denotes the id of what we're treating
# as individual observations, i.e. the choice situations.
fishing_obs_id_col = "observation"
# Determine what column denotes the choice that was made
fishing_choice_column = "mode"
# Create the list of observation specific variables
fishing_ind_variables = ["income"]
# Specify the variables that vary across individuals and some or all alternatives
# Note that each "main" key should be the desired name of the column in the long
# data format. The inner keys shoud be the alternative ids that that have some
# value for the "main" key variable.
fishing_alt_varying_variables = {"price": {1: "price.beach",
2: "price.pier",
3: "price.boat",
4: "price.charter"},
"catch": {1: "catch.beach",
2: "catch.pier",
3: "catch.boat",
4: "catch.charter"},
}
# Specify the availability variables
fishing_availability_variables = OrderedDict()
for alt_id, var in zip(range(1, 5), ["availability_{}".format(x) for x in range(1, 5)]):
fishing_availability_variables[alt_id] = var
In [13]:
##########
# Actually perform the conversion to long format
##########
long_fishing_df = pl.convert_wide_to_long(wide_data=wide_fishing_df,
ind_vars=fishing_ind_variables,
alt_specific_vars=fishing_alt_varying_variables,
availability_vars=fishing_availability_variables,
obs_id_col=fishing_obs_id_col,
choice_col=fishing_choice_column,
new_alt_id_name=fishing_alt_id)
# Look at the long format train data
long_fishing_df.head()
Out[13]:
In [14]:
# For the Train data, scale the price and time variables so the units
# are meaningful, namely hours and euros.
long_train_df["price_euros"] = long_train_df["price"] / 100.0 * 2.20371
long_train_df["time_hours"] = long_train_df["time"] / 60.0
For numeric stability reasons, it is advised that one scale one's variables so that the estimated coefficients are similar in absolute magnitude, and if possible so that the estimated coefficients are close to 1 in absolute value (in other words, not terribly tiny or extremely large). This is done for the fishing data below
In [15]:
# Scale the income data
long_fishing_df["income_thousandth"] = long_fishing_df["income"] / 1000.0
Note that this dataset is a stated-preference dataset with unlabeled alternatives. Because the unobserved elements that affect a person's choice are assumed to be the same for both alternatives, the mean of the error terms is expected to be the same for the two alternatives, therefore the alternative specific constants (ASCs) would be the same and the difference between the two ASCs would be zero. Because of this, no ASCs are estimated for the Train model.
In [16]:
# Look at the columns available for use in specifying the model
long_train_df.columns
Out[16]:
In [17]:
# Create the model specification
train_spec = OrderedDict()
train_names = OrderedDict()
# Note that for the specification dictionary, the
# keys should be the column names from the long format
# dataframe and the values should be a list with a combination
# of alternative id's and/or lists of alternative id's. There
# should be one element for each beta that will be estimated
# in relation to the given column. Lists of alternative id's
# mean that all of the alternatives in the list will get a
# single beta for them, for the given variable.
# The names dictionary should contain one name for each
# element (that is each alternative id or list of alternative
# ids) in the specification dictionary value for the same
# variable
for col, display_name in [("price_euros", "price"),
("time_hours", "time"),
("change", "change"),
("comfort", "comfort")]:
train_spec[col] = [[1, 2]]
train_names[col] = [display_name]
In [18]:
# Create an instance of the MNL model class
train_model = pl.create_choice_model(data=long_train_df,
alt_id_col=train_alt_id,
obs_id_col=train_obs_id_col,
choice_col=train_choice_column,
specification=train_spec,
model_type="MNL",
names=train_names)
# Estimate the given model, starting from a point of all zeros
# as the initial values.
train_model.fit_mle(np.zeros(4))
# Look at the estimation summaries
train_model.get_statsmodels_summary()
Out[18]:
Look at the corresponding results from mlogit:
Call: mlogit(formula = choice ~ price + time + change + comfort | -1, data = Tr, method = "nr", print.level = 0) Frequencies of alternatives: 12 0.50324 0.49676 nr method 5 iterations, 0h:0m:0s g'(-H)^-1g = 0.00014 successive fonction values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) price -0.0673580 0.0033933 -19.8506 < 2.2e-16 *** time -1.7205514 0.1603517 -10.7299 < 2.2e-16 *** change -0.3263409 0.0594892 -5.4857 4.118e-08 *** comfort -0.9457256 0.0649455 -14.5618 < 2.2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Log-Likelihood: -1724.2
In terms of differences between the estimation output of mlogit and my estimation output, the differencs seem mainly only be with the p-values. My p-values are calculated with respect to an asymptotic normal distribution whereas the p-values of mlogit are based on the t-distribution. This accounts for the p-value difference.
There is a very slight difference between mlogits value for the time parameter and my own time parameter estimate, but this may simply be due to the convergance criteria that each of the packages are using.
In [19]:
# Look at the columns available for use in specifying the model
long_fishing_df.columns
Out[19]:
In [20]:
# Create the model specification
fishing_spec = OrderedDict()
fishing_names = OrderedDict()
# Note that for the specification dictionary, the
# keys should be the column names from the long format
# dataframe and the values should be a list with a combination
# of alternative id's and/or lists of alternative id's. There
# should be one element for each beta that will be estimated
# in relation to the given column. Lists of alternative id's
# mean that all of the alternatives in the list will get a
# single beta for them, for the given variable.
# The names dictionary should contain one name for each
# element (that is each alternative id or list of alternative
# ids) in the specification dictionary value for the same
# variable
# Note the intercept for beach is constrained to zero for identification
fishing_spec["intercept"] = range(2, 5)
fishing_names["intercept"] = ["ASC: pier",
"ASC: boat",
"ASC: charter"]
fishing_spec["price"] = [[1, 2, 3, 4]]
fishing_names["price"] = ["price"]
# Note the income coefficient for beach is constrained to zero for identification
# Note also that we use the scaled variables because they numerically perform better
# fishing_spec["income"] = range(2, 5)
# fishing_names["income"] = ["income_{}".format(x)
# for x in ["pier", "boat", "charter"]]
fishing_spec["income_thousandth"] = range(2, 5)
fishing_names["income_thousandth"] = ["income_{} / 1000".format(x)
for x in ["pier", "boat", "charter"]]
fishing_spec["catch"] = range(1, 5)
fishing_names["catch"] = ["catch_{}".format(x) for x in
["beach", "pier", "boat", "charter"]]
In [21]:
# Create an instance of the MNL model class
fishing_model = pl.create_choice_model(data=long_fishing_df,
alt_id_col=fishing_alt_id,
obs_id_col=fishing_obs_id_col,
choice_col=fishing_choice_column,
specification=fishing_spec,
model_type="MNL",
names=fishing_names)
# Estimate the given model, starting from a point of all zeros
# as the initial values.
fishing_model.fit_mle(np.zeros(11))
# Look at the estimation summaries
fishing_model.get_statsmodels_summary()
Out[21]:
Look at the results from mlogit:
Call: mlogit(formula = mode ~ price | income | catch, data = Fishing, shape = "wide", varying = 2:9, method = "nr", print.level = 0) Frequencies of alternatives: beach boat charter pier 0.11337 0.35364 0.38240 0.15059 nr method 7 iterations, 0h:0m:0s g'(-H)^-1g = 2.54E-05 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) boat:(intercept) 8.4184e-01 2.9996e-01 2.8065 0.0050080 ** charter:(intercept) 2.1549e+00 2.9746e-01 7.2443 4.348e-13 *** pier:(intercept) 1.0430e+00 2.9535e-01 3.5315 0.0004132 *** price -2.5281e-02 1.7551e-03 -14.4046 < 2.2e-16 *** boat:income 5.5428e-05 5.2130e-05 1.0633 0.2876612 charter:income -7.2337e-05 5.2557e-05 -1.3764 0.1687088 pier:income -1.3550e-04 5.1172e-05 -2.6480 0.0080977 ** beach:catch 3.1177e+00 7.1305e-01 4.3724 1.229e-05 *** boat:catch 2.5425e+00 5.2274e-01 4.8638 1.152e-06 *** charter:catch 7.5949e-01 1.5420e-01 4.9254 8.417e-07 *** pier:catch 2.8512e+00 7.7464e-01 3.6807 0.0002326 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -1199.1 McFadden R^2: 0.19936 Likelihood ratio test : chisq = 597.16 (p.value = < 2.22e-16)
The coefficient estimates, std. error, and t-values are exactly equal. The p-value differences are, as already noted, because mlogit uses the t-distribution whereas pyLogit uses an asymptotic normal distribution. The log-likelihoods of our final model is also the same.
Note that the McFadden R^2 values are different. I am not sure how the mlogit value is calculated. From "Coefficients of Determination for Multiple Logistic Regression Analysis" by Scott Menard (2000), The American Statistician, 54:1, 17-24, we have the following equation:
In [22]:
##########
# Make sure that pyLogit's null log-likelihood
# and McFadden's R^2 are correct
##########
# Note that every observation in the Fishing dataset
# has 4 available alternatives, therefore the null
# probability is 0.25
null_prob = 0.25
# Calculate how many observations are in the Fishing
# dataset
num_fishing_obs = wide_fishing_df.shape[0]
# Calculate the Fishing dataset's null log-likelihood
null_fishing_log_likelihood = (num_fishing_obs *
np.log(null_prob))
# Determine whether pyLogit's null log-likelihood is correct
correct_null_ll = (null_fishing_log_likelihood ==
fishing_model.null_log_likelihood)
print "pyLogit's null log-likelihood is correct:", correct_null_ll
# Calculate McFadden's R^2
mcfaddens_r2 = 1 - (fishing_model.log_likelihood / fishing_model.null_log_likelihood)
print "McFadden's R^2 is {:.5f}".format(mcfaddens_r2)